A Parallel MapReduce Algorithm to Efficiently Support Itemset Mining on High Dimensional Data

نویسندگان

  • Daniele Apiletti
  • Elena Baralis
  • Tania Cerquitelli
  • Paolo Garza
  • Fabio Pulvirenti
  • Pietro Michiardi
چکیده

In today’s world, large volumes of data are being continuously generated by many scientific applications, such as bioinformatics or networking. Since each monitored event is usually characterized by a variety of features, highdimensional datasets have been continuously generated. To extract value from these complex collections of data, different exploratory data mining algorithms can be used to discover hidden and non-trivial correlations among data. Frequent closed itemset mining is an effective but computational expensive technique that is usually used to support data exploration. Thanks to the spread of distributed and parallel frameworks, the development of scalable approaches able to deal with the so called Big Data has been extended to frequent itemset mining. Unfortunately, most of the current algorithms are designed to cope with low-dimensional datasets, delivering poor performances in those use cases characterized by high-dimensional data. This work Preprint submitted to Big Data Research July 16, 2017 introduces PaMPa-HD, a MapReduce-based frequent closed itemset mining algorithm for high dimensional datasets. An efficient solution has been proposed to parallelize and speed up the mining process. Furthermore, different strategies have been proposed to easily configure the algorithm parameter. The experimental results, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and robustness to memory issues.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Weighted Itemset Mining from Bigdata using Hadoop

Data items have been extracted using an empirical data mining technique called frequent itemset mining. In majority of theapplication contexts items are enriched with weights. Pushing an item weights into the itemset extraction process, i.e., mining weighted itemsets rather than traditional itemsets, is an appealing research direction. Although many efficient weighteditemset mining algorithms a...

متن کامل

A New Algorithm for High Average-utility Itemset Mining

High utility itemset mining (HUIM) is a new emerging field in data mining which has gained growing interest due to its various applications. The goal of this problem is to discover all itemsets whose utility exceeds minimum threshold. The basic HUIM problem does not consider length of itemsets in its utility measurement and utility values tend to become higher for itemsets containing more items...

متن کامل

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Despite crucial recent advances, the problem of frequent itemset mining is still facing major challenges. This is particularly the case when: i) the mining process must be massively distributed and; ii) the minimum support (MinSup) is very low. In this paper, we study the effectiveness and leverage of specific data placement strategies for improving parallel frequent itemset mining (PFIM) perfo...

متن کامل

Performance Evaluation of Apriori Algorithm on a Hadoop Cluster

Frequent Itemset Mining is a well-known concept in data sciences. If we feed frequent itemset miner algorithms with large datasets they become resource hungry fast as their search space explodes. This problem is even more apparent when we try to use them on Big Data. Recent advances in parallel programming provides good solutions to deal with large datasets but they present their own problems w...

متن کامل

A Parallel and Distributed Method to mine Erasable Itemsets from High utility patterns

High utility pattern mining becomes a very important research issue in data mining by considering the non-binary frequency values of items in transactions and different profit values for each item. These profit values can be computed efficiently inorder to determine the gain of an itemset which in-turn will help in production planning of any company. This gain value is needed to prune some of t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Big Data Research

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2017